The Most Important Packages to Use for Beginners

These packages can get you started when working on data visualization as a beginner.


HOW TO INSTALL A PACKAGE

Use function **install.packages(“Name_of_package”“)** in console to install the package (you only need to do this once)

Call the package using library(Name_of_package) at the beginning of the script to use the package


DATA MANIPULATION

Dplyr

Dplyr is awesome to manipulate data structure in an easy and basic way!

Some functions include:

  • mutate() adds new variables that are functions of existing variables

  • select() picks column based on their name

  • filter() picks row based on their value

  • summarise() reduces rows to a single summary

  • arrange() changes row ordering

See documentation here

dplyr

dplyr


Tidyr

Tidy allows you to organize your data structures in a organized way!

Some functions include:

  • gather() takes columns and gathers them into rows

  • spread() takes two columns and spreads into multiple columns

See documentation here tidyr


GRAPHICS

Ggplot2

ggplot2 is the best package gor creating beautiful visualizations super easily!

How to Use:

  • ggplot() is the outside function to use for your plot
  • aes() is used for choosing axis variables and other ploting info
  • geom_point() and geom_histogram() is used to select the type of graph to create
  • scale_colour_brewer() can create beautiful color schemes

See documentation here

ggplot

ggplot

ggplot

ggplot


SYNTAX

Magrittr

Magrittr allows you to perform functions more efficiently

This package allows you to use piping in your code.

Piping uses the syntax %>% which takes one function or variable and pipes it in to be the first parameter of the second function.

An Example: Function 1 (2 * 4) %>% Function 2 (+ 3) = 11

See documentation here


DATASETS

A list of built in datasets to use for your next project!

  • HairEyeColor Distribution of hair and eye color and sex in 592 statistics students.
  • Sleep Data which show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients.
  • Orange Records of the growth of orange trees.
  • cars Speed and Stopping Distances of Cars.
  • rivers Lengths (in miles) of 141 “major” rivers in North America.
  • Presidents The (approximately) quarterly approval rating for the President of the United States.

Example scripts

add packages

  library(ggplot2)
  library(dplyr)
  library(tidyr)
  library(magrittr)

dplyr

Adding an additional colum titled time my dividing every value of dist by every value of speed

   knitr::kable(head(cars, 5))
   knitr::kable(mutate(head(cars, 5), time = round((dist/speed), digit = 2)))
speed dist
4 2
4 10
7 4
7 22
8 16
speed dist time
4 2 0.50
4 10 2.50
7 4 0.57
7 22 3.14
8 16 2.00

tidyr

Rearranging table to be grouped by ID with the extra amount of sleep becoming the value for the columns of groups

      knitr::kable(sleep)
      knitr::kable(sleep %>% 
              spread(group, extra))
extra group ID
0.7 1 1
-1.6 1 2
-0.2 1 3
-1.2 1 4
-0.1 1 5
3.4 1 6
3.7 1 7
0.8 1 8
0.0 1 9
2.0 1 10
1.9 2 1
0.8 2 2
1.1 2 3
0.1 2 4
-0.1 2 5
4.4 2 6
5.5 2 7
1.6 2 8
4.6 2 9
3.4 2 10
ID 1 2
1 0.7 1.9
2 -1.6 0.8
3 -0.2 1.1
4 -1.2 0.1
5 -0.1 -0.1
6 3.4 4.4
7 3.7 5.5
8 0.8 1.6
9 0.0 4.6
10 2.0 3.4

ggplot2

Creates a scatter plot of iris data

      plot <- ggplot(iris) +
        geom_point(aes(x = Petal.Length, y = Petal.Width, color = Species)) +
        scale_color_brewer(palette="Set2") +
        labs(title = "Petal Length VS Petal Width of Different Iris Species") 
        print(plot)